Problem Statement¶

To detect the voice of male or female based on given voice parameters. Create a model that can classify the given voice is of male or female. We have well labled dataset which has voice features which are related to male or female voice.

In [2]:
# Import necessary Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn import svm
import seaborn as sns
%matplotlib inline
In [3]:
# read the dataset
data = pd.read_csv('voice.csv')
data.head()
Out[3]:
meanfreq sd median Q25 Q75 IQR skew kurt sp.ent sfm ... centroid meanfun minfun maxfun meandom mindom maxdom dfrange modindx label
0 0.059781 0.064241 0.032027 0.015071 0.090193 0.075122 12.863462 274.402906 0.893369 0.491918 ... 0.059781 0.084279 0.015702 0.275862 0.007812 0.007812 0.007812 0.000000 0.000000 male
1 0.066009 0.067310 0.040229 0.019414 0.092666 0.073252 22.423285 634.613855 0.892193 0.513724 ... 0.066009 0.107937 0.015826 0.250000 0.009014 0.007812 0.054688 0.046875 0.052632 male
2 0.077316 0.083829 0.036718 0.008701 0.131908 0.123207 30.757155 1024.927705 0.846389 0.478905 ... 0.077316 0.098706 0.015656 0.271186 0.007990 0.007812 0.015625 0.007812 0.046512 male
3 0.151228 0.072111 0.158011 0.096582 0.207955 0.111374 1.232831 4.177296 0.963322 0.727232 ... 0.151228 0.088965 0.017798 0.250000 0.201497 0.007812 0.562500 0.554688 0.247119 male
4 0.135120 0.079146 0.124656 0.078720 0.206045 0.127325 1.101174 4.333713 0.971955 0.783568 ... 0.135120 0.106398 0.016931 0.266667 0.712812 0.007812 5.484375 5.476562 0.208274 male

5 rows × 21 columns

In [4]:
data.shape
Out[4]:
(3168, 21)

Dataset is consists of 3168 rows and 20 feature with one target.

Lets unserstand more about data

In [5]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3168 entries, 0 to 3167
Data columns (total 21 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   meanfreq  3168 non-null   float64
 1   sd        3168 non-null   float64
 2   median    3168 non-null   float64
 3   Q25       3168 non-null   float64
 4   Q75       3168 non-null   float64
 5   IQR       3168 non-null   float64
 6   skew      3168 non-null   float64
 7   kurt      3168 non-null   float64
 8   sp.ent    3168 non-null   float64
 9   sfm       3168 non-null   float64
 10  mode      3168 non-null   float64
 11  centroid  3168 non-null   float64
 12  meanfun   3168 non-null   float64
 13  minfun    3168 non-null   float64
 14  maxfun    3168 non-null   float64
 15  meandom   3168 non-null   float64
 16  mindom    3168 non-null   float64
 17  maxdom    3168 non-null   float64
 18  dfrange   3168 non-null   float64
 19  modindx   3168 non-null   float64
 20  label     3168 non-null   object 
dtypes: float64(20), object(1)
memory usage: 519.9+ KB
In [10]:
data.describe()
Out[10]:
meanfreq sd median Q25 Q75 IQR skew kurt sp.ent sfm mode centroid meanfun minfun maxfun meandom mindom maxdom dfrange modindx
count 3168.000000 3168.000000 3168.000000 3168.000000 3168.000000 3168.000000 3168.000000 3168.000000 3168.000000 3168.000000 3168.000000 3168.000000 3168.000000 3168.000000 3168.000000 3168.000000 3168.000000 3168.000000 3168.000000 3168.000000
mean 0.180907 0.057126 0.185621 0.140456 0.224765 0.084309 3.140168 36.568461 0.895127 0.408216 0.165282 0.180907 0.142807 0.036802 0.258842 0.829211 0.052647 5.047277 4.994630 0.173752
std 0.029918 0.016652 0.036360 0.048680 0.023639 0.042783 4.240529 134.928661 0.044980 0.177521 0.077203 0.029918 0.032304 0.019220 0.030077 0.525205 0.063299 3.521157 3.520039 0.119454
min 0.039363 0.018363 0.010975 0.000229 0.042946 0.014558 0.141735 2.068455 0.738651 0.036876 0.000000 0.039363 0.055565 0.009775 0.103093 0.007812 0.004883 0.007812 0.000000 0.000000
25% 0.163662 0.041954 0.169593 0.111087 0.208747 0.042560 1.649569 5.669547 0.861811 0.258041 0.118016 0.163662 0.116998 0.018223 0.253968 0.419828 0.007812 2.070312 2.044922 0.099766
50% 0.184838 0.059155 0.190032 0.140286 0.225684 0.094280 2.197101 8.318463 0.901767 0.396335 0.186599 0.184838 0.140519 0.046110 0.271186 0.765795 0.023438 4.992188 4.945312 0.139357
75% 0.199146 0.067020 0.210618 0.175939 0.243660 0.114175 2.931694 13.648905 0.928713 0.533676 0.221104 0.199146 0.169581 0.047904 0.277457 1.177166 0.070312 7.007812 6.992188 0.209183
max 0.251124 0.115273 0.261224 0.247347 0.273469 0.252225 34.725453 1309.612887 0.981997 0.842936 0.280000 0.251124 0.237636 0.204082 0.279114 2.957682 0.458984 21.867188 21.843750 0.932374

EDA¶

In [11]:
import pandas_profiling as pp
In [12]:
pp.ProfileReport(data)
Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]
Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]
Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]
Out[12]:

From EDA It is very clear that we have 0.1% of duplicate data which is very less but we have to remove duplicate rows. We have correlated data as well. So highly correlated feature may create problem. All the features are not null and numerical. Only target 'label' is categorical data so we have to do encoding for target variable. Features like kurt and skew are having large values than other features so these features are needs to be normalized or standardize.

In [14]:
# Remove duplicate records
data.drop_duplicates(keep='first')
Out[14]:
meanfreq sd median Q25 Q75 IQR skew kurt sp.ent sfm ... centroid meanfun minfun maxfun meandom mindom maxdom dfrange modindx label
0 0.059781 0.064241 0.032027 0.015071 0.090193 0.075122 12.863462 274.402906 0.893369 0.491918 ... 0.059781 0.084279 0.015702 0.275862 0.007812 0.007812 0.007812 0.000000 0.000000 male
1 0.066009 0.067310 0.040229 0.019414 0.092666 0.073252 22.423285 634.613855 0.892193 0.513724 ... 0.066009 0.107937 0.015826 0.250000 0.009014 0.007812 0.054688 0.046875 0.052632 male
2 0.077316 0.083829 0.036718 0.008701 0.131908 0.123207 30.757155 1024.927705 0.846389 0.478905 ... 0.077316 0.098706 0.015656 0.271186 0.007990 0.007812 0.015625 0.007812 0.046512 male
3 0.151228 0.072111 0.158011 0.096582 0.207955 0.111374 1.232831 4.177296 0.963322 0.727232 ... 0.151228 0.088965 0.017798 0.250000 0.201497 0.007812 0.562500 0.554688 0.247119 male
4 0.135120 0.079146 0.124656 0.078720 0.206045 0.127325 1.101174 4.333713 0.971955 0.783568 ... 0.135120 0.106398 0.016931 0.266667 0.712812 0.007812 5.484375 5.476562 0.208274 male
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3163 0.131884 0.084734 0.153707 0.049285 0.201144 0.151859 1.762129 6.630383 0.962934 0.763182 ... 0.131884 0.182790 0.083770 0.262295 0.832899 0.007812 4.210938 4.203125 0.161929 female
3164 0.116221 0.089221 0.076758 0.042718 0.204911 0.162193 0.693730 2.503954 0.960716 0.709570 ... 0.116221 0.188980 0.034409 0.275862 0.909856 0.039062 3.679688 3.640625 0.277897 female
3165 0.142056 0.095798 0.183731 0.033424 0.224360 0.190936 1.876502 6.604509 0.946854 0.654196 ... 0.142056 0.209918 0.039506 0.275862 0.494271 0.007812 2.937500 2.929688 0.194759 female
3166 0.143659 0.090628 0.184976 0.043508 0.219943 0.176435 1.591065 5.388298 0.950436 0.675470 ... 0.143659 0.172375 0.034483 0.250000 0.791360 0.007812 3.593750 3.585938 0.311002 female
3167 0.165509 0.092884 0.183044 0.070072 0.250827 0.180756 1.705029 5.769115 0.938829 0.601529 ... 0.165509 0.185607 0.062257 0.271186 0.227022 0.007812 0.554688 0.546875 0.350000 female

3166 rows × 21 columns

In [15]:
# Check if the data is balanced or not as we are dealing with binary classification problem
print('Total no of rows: %d'%data.shape[0])
print('Total no of male: %d'%(data[data['label']=='male'].shape[0]))
print('Total no of female: %d'%(data[data['label']=='female'].shape[0]))
Total no of rows: 3168
Total no of male: 1584
Total no of female: 1584

Data is balanced.

Feature Selection¶

In [16]:
features = data.iloc[:,:-1]
target = data.iloc[:,-1:]

Scaling and Encoding¶

In [17]:
# Encode the categorical target variable into 1 and 0.
from sklearn.preprocessing import LabelEncoder
gender_encoder = LabelEncoder()
target = gender_encoder.fit_transform(target)
target
C:\Users\ankush.ramrao.yadav\Anaconda3\lib\site-packages\sklearn\preprocessing\_label.py:116: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
Out[17]:
array([1, 1, 1, ..., 0, 0, 0])
In [18]:
# Split the data into train and test before the scaling to avoid data leakage.
X_train, X_test, Y_train, Y_test = train_test_split(features, target, test_size=0.2, random_state=40, shuffle=True)
In [21]:
# scale the data between -1 to 1
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
print(X_train)
#Scale test data as well
scaler.fit(X_test)
X_test = scaler.transform(X_test)
print(X_test)
[[-1.05464063  1.2019148  -0.45399449 ...  0.2484414   0.2610299
  -0.10762618]
 [ 1.00801451 -0.26704046  0.92461833 ...  0.71653546  0.72482233
  -0.71368778]
 [ 0.81528778  0.03185805  0.95648174 ... -1.18454467 -1.1855608
   0.82090761]
 ...
 [ 0.29931529 -0.83034082  0.32454203 ...  1.66376354  1.67228403
  -0.70925269]
 [ 0.11519245 -0.74361385 -0.12248314 ...  0.54872815  0.51280293
  -0.45037449]
 [ 1.49331404 -1.24960049  1.25237441 ...  1.2133334   1.2217428
  -0.46102291]]
[[ 0.82272861 -1.35049362  0.58377569 ...  0.45230837  0.46167023
  -0.6520387 ]
 [-1.35776275  1.20629184 -1.4716476  ...  0.08305213  0.09671652
  -0.4741144 ]
 [-0.09698037  0.24556463 -0.14786725 ... -0.13215857 -0.12316243
  -0.51966465]
 ...
 [ 0.36394862  0.23906265  0.67734322 ...  0.46136987  0.43446871
   0.29656335]
 [ 0.62685582 -0.05165681  0.82929923 ... -1.35546145 -1.38576626
  -0.4428497 ]
 [ 1.39231868 -1.32286311  1.06489015 ...  0.04454074  0.05364745
  -0.34589502]]

SVM with Default Hyper-Parameter¶

In [23]:
from sklearn import metrics
svc = svm.SVC()
svc.fit(X_train,Y_train)
Y_pred = svc.predict(X_test)
print('Accuracy Score:')
print(metrics.accuracy_score(Y_test,Y_pred))
Accuracy Score:
0.9779179810725552

Linear Kernal¶

In [24]:
svc = svm.SVC(kernel='linear')
svc.fit(X_train,Y_train)
Y_pred = svc.predict(X_test)
print('Accuracy Score:')
print(metrics.accuracy_score(Y_test,Y_pred))
Accuracy Score:
0.9747634069400631

Ploynomial Kernal¶

In [25]:
svc = svm.SVC(kernel='poly')
svc.fit(X_train,Y_train)
Y_pred = svc.predict(X_test)
print('Accuracy Score:')
print(metrics.accuracy_score(Y_test,Y_pred))
Accuracy Score:
0.9684542586750788

Optimization of HyperParameter C¶

In [29]:
# cross-validation import from sklearn
from sklearn.model_selection import cross_val_score

C_range = list(range(1,30))
roc_auc_score=[]
for c in C_range:
    svc = svm.SVC(kernel='rbf',C=c)
    scores = cross_val_score(svc,X_train,Y_train,cv=10,scoring='roc_auc')
    roc_auc_score.append(scores.mean())
print(roc_auc_score)
[0.9963979022153481, 0.9965847030839896, 0.9967279570522436, 0.9968650098425196, 0.9969833118516436, 0.997132812695288, 0.9972071198131482, 0.9972877249718785, 0.9973747273778277, 0.9974118793744532, 0.9974553813585801, 0.9974677325490564, 0.9974552325490563, 0.9973616313585802, 0.9974178813585801, 0.9974240337145357, 0.9974178829208847, 0.9973743305524309, 0.997436681742907, 0.9974365825365579, 0.9974616809617547, 0.997492979783777, 0.9975490290276217, 0.9975552782464693, 0.9975428766716661, 0.997574174712536, 0.9975741739313836, 0.9975866739313837, 0.9975742731377328]
In [32]:
plt.figure(figsize=(10,6))
C_values = list(range(1,30))
# plot C values in X-axis and cross_validate_accuracy on y-axis
plt.plot(C_values,roc_auc_score)
plt.xticks(np.arange(0,30,1))
plt.xlabel('Value of C for SVM')
plt.ylabel('Cross-Validate roc_auc_score')
Out[32]:
Text(0, 0.5, 'Cross-Validate roc_auc_score')

We are getting highest and stable roc auc score for value C=27¶

Optimization the Hyper-Parameter Gamma¶

In [34]:
gamma_range = [0.0001,0.001,0.01,0.1,1,10,100]
roc_auc_score = []
for g in gamma_range:
    svc = svm.SVC(kernel='rbf',gamma=g)
    scores = cross_val_score(svc,X_train,Y_train,cv=10,scoring='roc_auc')
    roc_auc_score.append(scores.mean())
print(roc_auc_score)
[0.9057395848956382, 0.9934657218628923, 0.995506239454443, 0.9970388224909387, 0.995625315976128, 0.9898582880264966, 0.5824438101487315]
In [35]:
%matplotlib inline
plt.figure(figsize=(20,6))
gamma_range=[0.0001,0.001,0.01,0.1,1,10,100]

# plot the value of C for SVM (x-axis) versus the cross-validated accuracy (y-axis)
plt.plot(gamma_range,roc_auc_score)
plt.xlabel('Value of gamma for SVC ')
plt.xticks(np.arange(0.0001,100,5))
plt.ylabel('Cross-Validated roc_auc_score')
Out[35]:
Text(0, 0.5, 'Cross-Validated roc_auc_score')

We are getting highest score at gamma = 0.1¶

In [49]:
svc=svm.SVC(kernel='rbf', C=27).fit(X_train, Y_train)
In [50]:
Y_pred = svc.predict(X_test)
print('Accuracy Score:')
print(metrics.accuracy_score(Y_test,Y_pred))
Accuracy Score:
0.973186119873817
In [51]:
# create confusion matrix for evaluation.
from sklearn.metrics import confusion_matrix,classification_report,recall_score,roc_curve,precision_score,roc_auc_score
cnf_matrix = confusion_matrix(Y_test,Y_pred)
cnf_matrix
Out[51]:
array([[298,   7],
       [ 10, 319]], dtype=int64)
In [52]:
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
Out[52]:
Text(0.5, 427.9555555555555, 'Predicted label')
In [53]:
# get Recall scores
recall = recall_score(Y_test,Y_pred)
recall
Out[53]:
0.9696048632218845
In [54]:
#get Precision score
precision = precision_score(Y_test,Y_pred)
precision
Out[54]:
0.9785276073619632
In [55]:
# create ROC curve
fpr, tpr, _ = roc_curve(Y_test,  Y_pred)
auc = roc_auc_score(Y_test, Y_pred)
plt.plot(fpr,tpr,label="auc="+str(auc))
plt.legend(loc=4)
plt.show()

Conclusion¶

We are getting ROC auc = 0.97 and precision = 97.8% which is quite good. Current model can predict the voice as MALE or FEMALE upto 97% accuracy.